Communication of a message using a network interface controller on a subnet

ABSTRACT

Particular embodiments described herein provide for a system for enabling the communication of a message using a network interface controller (NICs) on a subnet. In an example, the system is applicable to hardware offload NICs such as those implementing the Portals protocol. The system can be configured to determine a NIC in a first subnet to be used to communicate a message, where the NIC is configured to comply with a message passing interface protocol, create a manifest that includes an identifier of the NICs and a subnet ID that identifies the first subnet, and communicate the manifest to the receiver. If it is determined that the NIC in the first subnet to be used to communicate the message is a non-operating NIC and is not operational, the system can determine the network element that includes the non-operating NIC, determine an alternate NIC on a different subnet but on the same network element that includes the non-operating NIC, and communicate with the alternate NIC using the different subnet

TECHNICAL FIELD

This disclosure relates in general to the field of computing and/ornetworking, and more particularly, to the communication of a messageusing a network interface controller on a subnet.

BACKGROUND

Emerging network trends in both data center and telecommunicationnetworks place increasing performance demands on a system. Applicationperformance depends on good use of the network and efficient use of thedata traffic on the network. A network interface controller (NIC), alsoknown as a network interface card, network adapter, LAN adapter,physical network interface, and other similar terms, is a computerhardware component that connects a computer to a network and providesapplications with a dedicated, full-time connection to the network.

BRIEF DESCRIPTION OF THE DRAWINGS

To provide a more complete understanding of the present disclosure andfeatures and advantages thereof, reference is made to the followingdescription, taken in conjunction with the accompanying figures, whereinlike reference numerals represent like parts, in which:

FIG. 1 is a simplified block diagram of a system to enable thecommunication of a message using a network interface controller (NICs)on a subnet, in accordance with an embodiment of the present disclosure;

FIG. 2A-2C are a simplified block diagrams of a portion of a system toenable the communication of a message using a NIC on a subnet, inaccordance with an embodiment of the present disclosure;

FIGS. 3 is simplified block diagram of a NIC table illustrating examplesdetails to enable the communication of a message using a NIC on asubnet, in accordance with an embodiment of the present disclosure;

FIG. 4 is simplified block diagram of a packet illustrating examplesdetails to enable the communication of a message using a NIC on asubnet, in accordance with an embodiment of the present disclosure;

FIG. 5 is simplified block diagram of a portion of a packet illustratingexamples details to enable the communication of a message using a NIC ona subnet, in accordance with an embodiment of the present disclosure;

FIG. 6A and 6B are simplified block diagrams of a portion of a packetillustrating examples details to enable the communication of a messageusing a NIC on a subnet, in accordance with an embodiment of the presentdisclosure;

FIG. 7 is simplified block diagram of a portion of a packet illustratingexamples details to enable the communication of a message using a NIC ona subnet, in accordance with an embodiment of the present disclosure;

FIG. 8 is a simplified flowchart illustrating potential operations thatmay be associated with the system in accordance with an embodiment; and

FIG. 9 is a simplified flowchart illustrating potential operations thatmay be associated with the system in accordance with an embodiment;

FIG. 10 is a simplified flowchart illustrating potential operations thatmay be associated with the system in accordance with an embodiment; and

FIG. 11 is a simplified flowchart illustrating potential operations thatmay be associated with the system in accordance with an embodiment.

The FIGURES of the drawings are not necessarily drawn to scale, as theirdimensions can be varied considerably without departing from the scopeof the present disclosure.

DETAILED DESCRIPTION Example Embodiments

The following detailed description sets forth examples of apparatuses,methods, and systems relating to a system for enabling the communicationof a message using a network interface controller (NICs) on a subnet. Inan example, the communication may be of a large message using a NICs ona subnet The term “large message” refers to a message larger than anetwork maximum transmission unit (MTU). The MTU is the size of thelargest network layer protocol data unit that can be communicated in asingle network transaction. Features such as structure(s), function(s),and/or characteristic(s), for example, are described with reference toone embodiment as a matter of convenience; various embodiments may beimplemented with any suitable one or more of the described features.

In the following description, various aspects of the illustrativeimplementations will be described using terms commonly employed by thoseskilled in the art to convey the substance of their work to othersskilled in the art. However, it will be apparent to those skilled in theart that the embodiments disclosed herein may be practiced with onlysome of the described aspects. For purposes of explanation, specificnumbers, materials and configurations are set forth in order to providea thorough understanding of the illustrative implementations. However,it will be apparent to one skilled in the art that the embodimentsdisclosed herein may be practiced without the specific details. In otherinstances, well-known features are omitted or simplified in order not toobscure the illustrative implementations.

In the following detailed description, reference is made to theaccompanying drawings that form a part hereof wherein like numeralsdesignate like parts throughout, and in which is shown, by way ofillustration, embodiments that may be practiced. It is to be understoodthat other embodiments may be utilized and structural or logical changesmay be made without departing from the scope of the present disclosure.Therefore, the following detailed description is not to be taken in alimiting sense. For the purposes of the present disclosure, the phrase“A and/or B” means (A), (B), or (A and B). For the purposes of thepresent disclosure, the phrase “A, B, and/or C” means (A), (B), (C), (Aand B), (A and C), (B and C), or (A, B, and C).

FIG. 1 is a simplified block diagram of a system 100 to enablecommunication of a message using a NIC on a subnet, in accordance withan embodiment of the present disclosure. System 100 can include one ormore network elements 102 a-102 f. Each of network elements 102 a-102 fcan include memory, one or more applications, a communication engine,and two or more NICs. For example, network element 102 a can includememory 104, an application 106 a, a communication engine 108, and aplurality of NICs 110 a-110 c. Network element 102 b can include memory104, an application 106 b, a communication engine 108, and a pluralityof NICs 110 d-110 f. Network element 102 c can include memory 104, anapplication 106 c, a communication engine 108, and a plurality of NICs110 g-110 i. Each application 106 a-106 c may be a virtual networkfunction (VNF). Each NIC 110 a-110 h may be a high-speed NIC such as ahigh-speed host fabric interface (HFI).

Each of network elements 102 a-102 f, more specifically one or more NICsin each of network elements 102 a-102 f may be in communication witheach other using a sub-network (subnet). For example, NIC 110 c innetwork element 102 a and NIC 110 d in network element 102 b may be incommunication using subnet 112 a. NIC 110 f in network element 102 b andNIC 110 g in network element 102 c may be in communication using subnet112 b. NIC 110 a and NIC 110 b in network element 102 a, NIC 110 e innetwork element 102 b, and NIC 110 h in network element 102 c may be incommunication using subnet 112 c. Network elements 102 a-102 c may alsobe in communication with each other using network 114. Each of subnets112 a-112 c are a logical subdivision of network 114. In an example,network 114 may be part of a datacenter. In an example, one or moreelectronic devices 116 and open network 118 may be in communication withnetwork 114. Each of NICs 110 a-110 i is configured to comply with amessage passing interface protocol.

It is to be understood that other embodiments may be utilized andstructural changes may be made without departing from the scope of thepresent disclosure. For example, each network element may have one ormore applications, one or more regions of memory, and/or one or moreNICs. Substantial flexibility is provided by system 100 in that anysuitable arrangements and configuration may be provided withoutdeparting from the teachings of the present disclosure.

System 100 can be configured to enable a recovery scheme making thesystem resilient to NIC failures. In addition to system configurationsconsisting of dual planes, the system may even be used when two networkelements are connected by multiple NICs possibly on different subnets aslong as there are multiple NICs and paths through the network betweentwo network elements. In an example, system 100 can be configured toinclude a subnet ID when communicating messages across network 114. Thesubnet ID (e.g., 8 bits) identifies the subnet that includes the NICthat communicated the message and allows the system to determine thecommunication path needed to communicate with NICs across the system. Ifa first NIC on a first network element is in a first subnet and fails,then the communications to the first NIC can be rerouted to a second NICon the first network element in a second subnet. In an example, thesystem can be enabled to determine a NIC in a first subnet to be used tocommunicate a message using a first subnet, create a manifest thatincludes an identifier of the NIC and a subnet ID that identifies thefirst subnet, and communicate the manifest to the receiver. The term“receiver” includes the specific NIC that will receive the manifest. Forexample, if NIC 110 a was to communicate the manifest to NIC 110 e, thenNIC 110 e would be the receiver. The subnet ID can be added to a NICtable and the NIC table can include an identification of a plurality ofNICs, a subnet associated with each of the plurality of NICs, and anidentification of a specific network element that includes a specificNIC. The NIC table can also include an indicator when the specific NICis active on a specific subnet that is associated with the specific NIC.When it is determined that one of the plurality of NICs in the firstsubnet to be used to communicate the message is a non-operating NIC andis not operational, the system can determine the network element thatincludes the non-operating NIC, determine an alternate NIC on adifferent subnet but on the same network element that includes thenon-operating NIC and communicate with the alternate NIC using thedifferent subnet. The communication with the alternate NIC can include arequest to disable the non-operating NIC.

Elements of FIG. 1 may be coupled to one another through one or moreinterfaces employing any suitable connections (wired or wireless), whichprovide viable pathways for network (e.g., network 112, etc.)communications. Additionally, any one or more of these elements of FIG.1 may be combined or removed from the architecture based on particularconfiguration needs. System 100 may include a configuration capable oftransmission control protocol/Internet protocol (TCP/IP) communicationsfor the transmission or reception of packets in a network. System 100may also operate in conjunction with a user datagram protocol/IP(UDP/IP) or any other suitable protocol where appropriate and based onparticular needs.

For purposes of illustrating certain example techniques of system 100,it is important to understand the communications that may be traversingthe network environment. The following foundational information may beviewed as a basis from which the present disclosure may be properlyexplained.

Message passing interface (MPI) has become the de-facto programmingmodel used in high performance computing (HPC). MPI provides bothpoint-to-point as well as collective communications capabilities. Thesoftware stack for some NICs can provide a low-level communicationapplication program interface (API) called the open fabric interface(OFI) that takes advantage of the off-load capabilities of hardware toimplement efficient point-to-point and collective communications.

Tag match bits are part of the current MPI protocol. Rather than simplystreaming data like TCP, MPI defines messages as a finite payload plusan envelope (which contains tag match bits). At the receiver, incomingmessages are only delivered to memory regions that are associated withtag match bits found in the message's envelope. Messages communicatedfrom one device to another device include tag match bits that thereceiver can use to identify messages and determine where the messageneeds to be stored. In the current MPI protocol, multiple messages canhave the same tag match bit value. If there are multiple messages from asingle initiator and multiple receives at the receiver with the same tagmatch bits, the messages are matched in the order they are sent, per theMPI tag matching rule, even if the messages arrive out of order (e.g.,due to different network paths or stale packets).

Recent trends in High Performance Computing (HPC) systems havedemonstrated that a move to Peta and Exa-scale capable machines ispossible not only because of improvements in the capabilities of theindividual nodes or network element and the interconnects, but also anincrease in the scale of the system size using a significantly largenumber of nodes. However, several challenges arise as the size of thesystem starts to grow. One challenge relates to efficiency as efficiencyconcerns demand low-overhead communication protocols that can becompletely offloaded to hardware. While most modern RDMA capabledevices, provide such offload, some devices support capabilities toenable efficient implementations of MPI and PGAS programming models.This includes support for efficient tag matching, collectives, andatomics that are expected to be used heavily by software and middlewarecomponents. Another challenge relates to application performance, alarge part of which depends on the communication patterns of theapplications that run on such systems and at least partially dictate thechoice of topology. Topologies such as fat-trees provide goodfault-tolerance through path-diversity, full bisection bandwidth forarbitrary permutations of communications and require simpler routingalgorithms for deadlock free routing. However, they offer no adaptiverouting features that allow the fabric to react to congestion in thenetwork and choose alternative paths and can tolerate packets thatarrive out-of-order as a result. Another challenge relates to cost and abig factor of cost is the cost of the infrastructure needed for suchlarge systems (e.g. cost of expensive cables). Topologies such asDragonfly which aim to improve overall performance, reduce system costand provide features such as adaptive routing have become popular at theexpense of more complicated routing requirements.

As systems get bigger their mean time between failure (MTBF) decreasessignificantly and such systems must be architected to ensure that theyare resilient to failures. Some large systems use a Dragonfly topologyand are built using “dual-plane” configurations where each node isconnected to two identical but redundant fabric planes (subnets) via twoseparate NICs. While the Dragonfly topology offers enough redundancy ina single fabric plane (e.g., alternate routes between a pair of nodes incases of link or switch failures), there are “weak links” in the systemthat lower the overall resiliency of the system. Two problematic weaklinks are the NIC itself and the the link between the NIC and itsneighboring switch in the group (the first level switch).

Some current systems address the need for fault-tolerant schemes toimprove the overall resiliency of a large system. One system describes alight-weight protocol for message passing software that is able torecover from network failures such as switches and cables and end-nodenetwork devices themselves, however, such a protocol relies on the useof RC transport queue pairs (QPs) for reliability and errorstate/notification. The system also builds a higher level (MPI) protocolon top of what the IB HCA provides to work around some of the problemsidentified and also uses additional memory for buffering that limitsscalability. In addition, given the increase in scale of HPC systems,recent advancements in MPI have shifted focus from RC QPs, which requireper-peer connection state (memory foot print) that limits scalabilityand UD QPs offer a connection less model, but lack the reliability anderror notification needed by the protocol. Another current systemdescribes a scheme that provides resilience using multiple rails but islimited to MPI over IB. In addition, none of the current systems addressthe issue of stale packets.

A system for communication of a message using a network interfacecontroller on a subnet, as outlined in FIG. 1, can resolve these issues(and others). MPI is the de-facto communication model forhigh-performance computing (HPC) and provides the ability to send andreceive messages between processes and nodes. MPI provides bothpoint-to-point as well as collective communications capabilities. Thesoftware stack for some NICs can provide a low-level communicationapplication program interface (API) called the open fabric interface(OFI) that takes advantage of the off-load capabilities of hardware toimplement efficient point-to-point and collective communications. System100 can be configured to communicate a large message across multipleNICs and still maintain message/tag match ordering and not violate theMPI tag matching rule that is part of the MPI protocol. In system 100,each NIC is configured to comply with the MPI protocol

The MPI tag matching rule is commonly understood as an orderingguarantee of certain properties of point-to-point semantics. The MPI tagmatching rule is a non-overtaking rule that includes a requirement thatfacilitates the matching of sends to receives and guarantees thatmessage-passing is deterministic. In cases where a single NIC and asingle remote NIC are used to send and receive messages, tag matchingfor small and large messages can be relatively easily maintained becauseall messages are sent in order, initiated from a single NIC, and aredestined to a single remote NIC.

When sending large messages, a GET command may be used. The GET commandis part of the current MPI protocol. In a GET operation (e.g., anoperation that uses the GET command), a receiver communicates a messageto an initiator (in this example the initiator is commonly referred toas a target since it is the “target” of the GET command) requesting datafrom the initiator. The receiver can use the GET command to pull thedata from the initiator's memory region that includes the data. One ofthe parameters of a GET command is an offset from which to start pullingdata. For example, to pull a message with length “N” bytes, a first GETcommand can be issued with an offset of 0 and length N/2 and a secondGET command can be issued with an offset of N/2 and a length of N/2. Theoffset is included in the header portion of the message that is part ofthe GET command. Another parameter included in the header portion of themessage is the message identifier (ID). Also, in a PUT command from theinitiator to the receiver, the message ID will be included in theheader. The message ID is a unique value that distinguishes one messagefrom all others.

Tag match bits are also part of the current MPI protocol. Rather thansimply streaming data like TCP, MPI defines messages as a finite payloadplus an envelope (which contains tag match bits). At the receiver,incoming messages are only delivered to memory regions that areassociated with tag match bits found in the message's envelope. Messagescommunicated from one device to another device include tag match bitsthat the receiver can use to identify messages and determine where themessage needs to be stored. In the current MPI protocol, multiplemessages can have the same tag match bit value. If there are multiplemessages from a single initiator and multiple receives at the receiverwith the same tag match bits, the messages are matched in the order theyare sent, per the MPI tag matching rule, even if the messages arrive outof order (e.g., due to different network paths). When the messagesarrive and are matched out of order, errors can occur.

In an example, system 100 can be configured to send a relatively smallmessage from an initiator on a chosen primary NIC that provides areceiver with a list of initiator NICs as well as transport protocolspecific information. The primary NIC is a NIC that is used for allmessages (independent of size) for a pair of MPI processes. The receivercan then use the list of initiator NICs as targets to communicate aplurality of GET commands and pull a large message from the initiator'smemory when the receiver is ready to pull the large message. System 100can be configured such that a GET command from the receiver to theinitiator includes a message ID as the tag match bits. The message IDidentifies the exposed region of memory on the initiator where the datafor the large message is located. The receiver can communicate themessage ID to the initiator so the initiator knows what data and regionof memory is associated with the GET command. In addition, the receivercan choose a set of NICs on its node to initiate the GET commands.

In order to distinguish between an application initiated small messagefrom the small message that includes the list of initiator NICs, themessage that includes the list of initiator NICs utilizes an existingheader data field but encodes the header data field with informationthat will allow the receiver to recognize that the initiator was tryingto send a large message using multiple NICs. The envelope portion of thesmall message that includes the list of initiator NICs can include amessage ID (to be used as tag match bits) for the message. The messageID can be used by the receiver to identify the exposed region of memoryon the initiator where the data for the message is located. Theinitiator's response to the GET command will also include the message IDand the receiver can use the message ID to determine where to store eachchunk of the large message as it arrives at the receiver in thereceiver's exposed regions of memory and reconstruct the chunks of thelarge message.

The list of initiator NICs can be included in the payload portion of themessage as a manifest. Once the initial small message that includes themanifest is delivered to the receiver from the initiator, the receivercan analyze the manifest and create a strategy to pull the full largemessage across the network using multiple GET commands. For example, thereceiver can initiate a pull of the entire large message using one ormore commands that will cause the communication of the large message tobenefit from topologies that support other efficiencies such as adaptiverouting. This allows the chunks of the large message to be delivered inany order and the manifest provides the receiver with enough informationto pull the entire large message once the tag matching is successfullycompleted. The MPI tag matching rule is not violated because the initialsmall message includes the message ID (to be used as tag matching bits)necessary to ensure the chunks are sent to the correct memory region inthe receiver and allow the chunks to be reconstructed into the originallarge message.

In MPI, a message consists of a data payload plus an envelope containingmetadata. The envelope of each message can include a source, a tag, anda communicator. The source, tag, and communicator are used to match asend operation to a particular receive operation. Receive operations areposted to a pending receive queue where they await incoming messages. Aseach message arrives, its envelope is compared to each of the queuedreceives and the message is delivered (copied) into the memory areaallocated by the first matching receive.

When presented with a message to send, the initiator first decides if itmust employ a multi-unit message scheme (herein referred to as “MU”). Ifthe message is smaller than the MTU, the message can be sent using astandard PUT command because the entire message fits in a single packetand can be delivered to the receiver with a single message. In a currentstandard PUT command (e.g., MPI PUT command), the header portion of thePUT command includes a message ID. The MTU is the size of the largestnetwork layer protocol data unit that can be communicated in a singlenetwork transaction. Fixed MTU parameters usually appear in associationwith a communications interface or standard. If the message is above athreshold, the send is initiated by sending a relatively small amount ofdata (e.g., less than or equal to 64 bytes) in the payload portion ofthe message that describes a list of NICs that the initiator plans touse for the communication of the large message. The threshold caninclude messages that are above the MTU. In an example, an administratormay set the threshold as a message size that is larger the MTU based onsystem performance, bandwidth, overall cost, or other considerations. Insome systems, if the message is larger than the network MTU, the messagemay be sent using a single NIC or across multiple NICs as messages thatare slightly larger than the MTU may not benefit from MU.

The relatively small amount of data in the payload portion is called themanifest. The manifest can include a list of NICs on the initiator thatthe receiver can use to obtain the large message. The list of NICs mayinclude all NICs on the initiator that can be employed to reach thedestination on the receiver. In some cases, only a subset of the NICsavailable on the initiator might be part of this list. Note that theinitiator might choose to prune the list of all the NICs that areavailable to reduce the overhead associated with exposing regions ofmemory to multiple NICs, reserve NICs for other traffic, and/or onlyprovide NICs that are accessible from the receiver. The initiator canexpose the allocated region of memory that includes the data for thelarge message to each NIC in the manifest that the initiator plans touse for communicating the large message. The receiver is free to choosefrom any of these NICs to pull any part or chunk of the large message.

In an illustrative example, the initiator can determine a plurality oflocal NICs that are able to communicate the large message and theaddress of each NIC. The initiator exposes the memory region thatincludes the data for the large message to the plurality of NICs, buildsthe manifest containing the addresses of all the plurality of NICs thatwill be used to communicate the large message to the receiver, inserts amessage ID in an envelope portion of an MU message, and communicate theMU message that includes the manifest and message ID to the receiverinstead of the large message. The header portion of the MU messageincludes a toggle flag or bit that indicates the initiator intends touse MU. The term “header data” includes header information (e.g., asource identifier, a destination identifier, type, etc.). If the MU bitis set, then the receiver knows that the message data is not applicationdata but includes the manifest (the list of addressing information forthe initiator's NICs). The receiver knows what NICs it has and, based onthe manifest, the receiver knows what NICs the initiator has availablefor the receiver to request the large message using GET commands. Thereceiver can use GET commands to pull the large message from the memoryregion exposed to each NIC. For example, to pull a message with length“N” bytes in two chunks, one NIC will issue a GET command with an offsetof 0 and length N/2 and the other NIC will issue a GET command with anoffset of N/2 and also a length of N/2. The GET command to the initiatorcan also include a unique message ID in the envelope portion of the GETcommand. The message ID is a unique tag match value that the initiatorassociates with the memory region (that includes the data for the largemessage) that has been exposed to each NIC listed in the manifest. Togenerate the message ID, a simple counter can be read and incrementedfor each message or it can be derived from a hardware resource uniquelyassociated with the send operation (e.g., the NICs match entry handlefor the exposed memory region). A key requirement is that the message IDis a unique value that distinguishes the large message from all othermessages. When the receiver issues GET commands to pull message data, ituses the message ID (in the envelope portion of the MU message sent bythe initiator) as the tag match bits to ensure the correct data isrequested and received.

System 100 can be configured to utilize a fine-grained scheme where asingle pair of MPI ranks can use multiple NICs (both on the source anddestination) to transfer a single large message to a destination MPIrank. This can enable a software protocol on top of an RDMA capabledevice that offloads not only a transport protocol but can also handletag matching and not violate the tag matching rule. In system 100, allmessages, small and large are initiated from a primary NIC. This mightseem limiting at first but the limitation may be eliminated, ormitigated, by observing that the MPI tag matching rule only applies to apair of MPI endpoints. This means that the primary NIC used for tagmatching may be different for different pairs of MPI processes (e.g.,the MPI rank ID may be used as a hashing function to compute which ofthe local NICs should be used to send the initial message that includesthe manifest and message ID).

In an example, when a receiver receives a message, the receiver examinesthe header data to see whether the message is an MU message and MU is inuse. If the message sent by the initiator was not an MU message (e.g.the message was small or below the threshold) then the receiver acceptsthe message as a standard PUT command because the entire message fits ina single message packet and is delivered to the receiver. If the messageis large (e.g., the message is larger than an MTU) but MU is notenabled, then the message may be broken up into chunks and each chunkcan be sent using a RTS/CTS mechanism to transfer each chunk. This couldalso happen if the initiator is a legacy initiator that does not supportMU. If the MU fields in the header data indicate that the initiatorchose not to use MU, an expected receive operation is performed. If theheader data is present and the MU indicator is set, indicating that thepayload in the message includes the manifest, the receiver then tries tointerpret the manifest to build a list of NICs to use to pull the largemessage from the initiator.

System 100 can be configured to benefit from the use of multiple NICsand paths to transfer a large message. In large systems where computenodes are configured with resilient topologies such as Dragonfly andmultiple subnets, such protocols increase the bandwidth of messagetransfer proportional to the number of NICs on each node. This isespecially useful when each NIC is connected to a different subnetbecause the application is able to utilize both planes to transfer largemessages. With some basic elements in the NIC hardware, fabricmanagement, and a communication library (e.g., the runtime software)running on the node, the network may be extended to provide resiliencyin the case of failures

In an example with a first plane and a second plane, where a message(broken into packets) is sent between a pair of processes running on{group_0, switch_m-1, node_0} and {group_k, switch_0, node_1}respectively, if the first plane was being used (HFI 0 on each node) andthere was a link or switch disruption in the first plane, the routingtables in the switches provide alternate routes to deliver the messageto the destination, due to adaptive routing schemes (e.g., adaptiverouting schemes are available in topologies such as DragonFly and othertopologies). Even without disruptions in the fabric, adaptive routingcan improve performance by using alternate non-minimal routes to thedestination and work around congestion in the network. However, thisassumes that packets of the message are able to ingress into the fabricmeaning that the NIC that was used to send the message and the “local”link that connects the NIC port to the first level switch are stillfunctional. If they are not still functional, then packets can be lost.

System 100 can be configured such that upon detection of a messagefailure (or part of a message failure), communication engine 108 canutilize the information included in the manifest to find alternate NICsand replay such messages. Since the use of RDMA that is offloaded tohardware allows incoming data to be placed directly in the finalapplication buffer without kernel involvement, a key concern with replayof a message is that of stale packets that arrive much later. As anexample, a message is replayed on a different NIC following a failure.After a message is marked complete (completion event) and ownership ofthe buffer associated with the NIC is handed off to the application,packets that were injected from the original NIC pair may still be inthe network and can land on the buffer resulting in data corruption(e.g. if the application has used the data and repurposed the buffer foranother message). The manifest can provide enough information forsoftware to disable the appropriate entries in the NIC and protectagainst such cases.

In a simplified version of a network topology, switches in a subnet areconnected to one or more network elements through NICs. Each switch isconnected to every other switch in the subnet through a set of links(e.g., local group all-to-all). Links also connect the local switches toswitches in other groups. In an example, a message (broken into packets)is sent between a pair of applications (e.g., application 106 a and 106b). If a first subnet was being used and there was a link or switchdisruption in the first subnet, the routing tables in the switches ofthe subnet would provide alternate routes to deliver the message to thedestination, thanks to adaptive routing schemes available in topologiessuch as DragonFly. Even without disruptions in the subnet, adaptiverouting can improve performance by using alternate non-minimal routes tothe destination and work around congestion in the network. However, allof this assumes that packets of the message are able to ingress into thesubnet meaning that the NIC they were sent on and the local link thatconnects the NIC to the first level switch are still functional.

System 100 can be configured to stripe a large message across multipleNICs in order to proportionally increase the bandwidth of thecommunication between two applications or processes. In the case of adual-plane configuration, such a scheme utilizes both planes becauseeach node is connected using two NICs that are in turn part of twoseparate but identical planes or subnets. However, there does not exista mechanism or protocol to provide a reliably way to transfer a messagein case of failures on one of the planes (e.g., a link or switch). The“weak links” described above are especially important given a failure ofa NIC or a local link can cause an overall application failure unlesssoftware is able to failover messages (replay them) on the otherfunctioning NIC.

In an example, system 100 can be configured to split large messages intosegments that are sent across multiple NICs. In addition, to providingperformance benefits by utilizing multiple NICs between a pair of nodes,the system can be configured to provide fault tolerance in cases offailures in the network. For example, with the right information,communication engine 108 can replay failed messages on other availableresources. This of course assumes that the initial tag matching alwayshappens consistently on a set of primary NICs between two peer nodes.

The broad requirements to build a resilient scheme consists of ahardware based error detection and reporting scheme. This can include asimple hardware based event delivery mechanism on the initiator and thetarget which indicate the success or failure along with a reason code.The broad requirements also include a well-defined interface betweenhardware and software (e.g. event queues and asynchronous errorreporting) that allow software to configure, detect, and react toerrors. The broad requirements also include use of multiple NICs to sendand receive messages. In case of failures, messages that failed can bereplayed on the functional NICs from the available list.

In an example, system 100 can be configured to maintain correctness forhardware offload NICs. More specifically, system 100 is applicable tohardware offload NICs such as those implementing the Portals protocol ora similar protocol. The Portals protocol is based on the concept ofelementary building blocks that can be combined to support a widevariety of upper-level protocols and network transport semantics. In anexample, system 100 can be configured to provide a generic scheme thatuses some of the concepts of the Portals protocol and provide a API thatis can allow scalable, relatively high-performance network communicationbetween nodes of a parallel computing system and attempts to provide acohesive set of building blocks with which a wide variety of upper layerprotocols (such as MPI, SHMEM, or UPC) may be built, while maintainingrelatively high performance and scalability.

Upon detection of a failure (e.g. a message failed after severalretries) hardware (e.g., NIC 110 a) can report failures to acommunication engine (e.g., communication engine 108) via an event queuelike interface. Each event can provide a reason code as well as areference to the message that failed. Such events allow thecommunication engine to decide when to replay messages on alternativeresources. For example, if the reason code indicates that the messagefailed because the packets contained an invalid destination, there is nopoint retrying the message on a different NIC because it will result inthe same outcome. However, if the reason code indicates that a hardwaretimeout resulted in the failure (e.g. an acknowledgement message (ACK)was not received from the target because of a remote NIC failure or linkdisruption), the communication engine can evaluate other possibleoptions and potentially replay a failed message on a different NIC onthe network element. It must be noted such a mechanism relies on thedelivery of an event on the local NIC on which the original message issent. In cases where the local NIC itself has failed, a mechanism canexist by which the communication engine is informed of a “fatal event”(e.g. communication to the communication engine) allowing thecommunication engine to replay all outstanding messages over other NICsin the network element (not knowing what parts of each message arecomplete).

The communication engine can handle potential stale packets in thenetwork by performing a “flush” operation. In addition, communicationengine can enumerate failed NICs in a NIC table and eliminate them frombeing advertised as an available resource. It must be noted that a NICmight be eliminated globally (for all targets) or to only specifictargets depending on the failure. For instance, if the failure to send amessage was due to a local (level 0) link failure, this impacts sendingmessages to any target and therefore is a global failure. However, ifthe failure to send a message to a target was due to a target side linkfailure, the NIC that was used can still be used to send messages toother targets.

In a specific example, a message (e.g., the reserved bits of themessage) can be extended to include a subnet ID. The subnet ID (e.g., 8bits) identifies the subnet that includes the NIC that communicated themessage. A subnet consists of a set of ports where each port has one ormore NICs such that all NICs in a given subnet are unique. In clustersthat consists of multiple subnets, the fabric manager will configureeach subnet with a separate subnet ID. This can also be applied toexamples where each NIC has multiple ports and each port is assigned toa different subnet.

In an example, an initiator node “advertises” a list of NICs by which itexposes an application buffer. This list of NICs is used by the targetwhen it issues GET requests. To ensure correctness, the target mustbuild a list of NICs by performing a match of the subnet ID of each ofits local NICs with the ones provided by the initiator. Once the pair ofNICs have been identified, the target can issue GET requests overmultiple pairs of NICs. Each request identifies a different offset intothe initiator side buffer.

The target must wait for every message chunk to complete successfullybefore it can mark a message complete and hand it off to theapplication. In cases where one or more chunks fail, failure eventsdelivered by the NIC must be examined to see if a message can be retriedon a different NIC. To ensure correctness when another NIC is used, thetarget must compare subnet IDs to make sure it issues GET requests tothe correct NICs on the initiator side.

In the case where response packets are delayed in the network (e.g.congestion or fabric failures), a failure is reported to the applicationas a result of a timeout. Since the NIC places incoming data directly inthe destination buffer, it is possible for these delayed packets toarrive on a different NIC or subnet plane after the retried message iscompleted successfully and the buffer is handed off to the application.This leads to data corruption. To handle such scenarios, the target mustdisable the NIC that was used to issue the GET request. Once disabled,stale packets that arrive at the NIC are dropped and a disable message(e.g., a PTL_NI_PT_DISABLED event) is sent back to the initiator of thepackets to disable the NIC. Before re-enabling the NIC, thecommunication engine must perform a recovery operation where a globalbarrier is sent to all nodes in the job to ensure they stop sendingpackets. It must be noted that disabling the NIC and continuing with theother NICs in the network element can be done relatively quickly anddoes not need the recovery procedure to be completed or even started.Therefore, such a process allows the application to make furtherprogress in the case of failures. If such a resiliency scheme is notimplemented, the application must be aborted and restarted.

Turning to the infrastructure of FIG. 1, system 100 in accordance withan example embodiment is shown. Generally, system 100 may be implementedin any type or topology of networks. Network 114 represents a series ofpoints or nodes of interconnected communication paths for receiving andtransmitting packets of information that propagate through system 100.Network 114 offers a communicative interface between nodes, and may beconfigured as any local area network (LAN), virtual local area network(VLAN), wide area network (WAN), wireless local area network (WLAN),metropolitan area network (MAN), Intranet, Extranet, virtual privatenetwork (VPN), and any other appropriate architecture or system thatfacilitates communications in a network environment, or any suitablecombination thereof, including wired and/or wireless communication.

In system 100, network traffic, which is inclusive of packets, frames,signals, data, etc., can be sent and received according to any suitablecommunication messaging protocols. Suitable communication messagingprotocols can include a multi-layered scheme such as MPI, Open SystemsInterconnection (OSI) model, or any derivations or variants thereof(e.g., Transmission Control Protocol/Internet Protocol (TCP/IP), userdatagram protocol/IP (UDP/IP)). Messages through the network could bemade in accordance with various network protocols, (e.g., Ethernet, MPI,Infiniband, OmniPath, etc.). Additionally, radio signal communicationsover a cellular network may also be provided in system 100. Suitableinterfaces and infrastructure may be provided to enable communicationwith the cellular network.

The term “packet” as used herein, refers to a unit of data that can berouted between a source node and a destination node on a packet switchednetwork. A packet includes a source network address and a destinationnetwork address. These network addresses can be Internet Protocol (IP)addresses in a TCP/IP messaging protocol. The term “data” as usedherein, refers to any type of binary, numeric, voice, video, textual, orscript data, or any type of source or object code, or any other suitableinformation in any appropriate format that may be communicated from onepoint to another in electronic devices and/or networks. Additionally,messages, requests, responses, and queries are forms of network traffic,and therefore, may comprise packets, frames, signals, data, etc.

In an example implementation, network elements 102 a-102 f, are meant toencompass network elements, network appliances, servers, routers,switches, gateways, bridges, load balancers, processors, modules, or anyother suitable device, component, element, or object operable toexchange information in a network environment. Network elements 102a-102 f may include any suitable hardware, software, components,modules, or objects that facilitate the operations thereof, as well assuitable interfaces for receiving, transmitting, and/or otherwisecommunicating data or information in a network environment. This may beinclusive of appropriate algorithms and communication protocols thatallow for the effective exchange of data or information. Each of networkelements 102 a-102 f may be virtual or include virtual elements.

In regard to the internal structure associated with system 100, each ofnetwork elements 102 a-102 f can include memory elements for storinginformation to be used in the operations outlined herein. Each ofnetwork elements 102 a-102 f may keep information in any suitable memoryelement (e.g., random access memory (RAM), read-only memory (ROM),erasable programmable ROM (EPROM), electrically erasable programmableROM (EEPROM), application specific integrated circuit (ASIC), etc.),software, hardware, firmware, or in any other suitable component,device, element, or object where appropriate and based on particularneeds. Any of the memory items discussed herein should be construed asbeing encompassed within the broad term ‘memory element.’ Moreover, theinformation being used, tracked, sent, or received in system 100 couldbe provided in any database, register, queue, table, cache, controllist, or other storage structure, all of which can be referenced at anysuitable timeframe. Any such storage options may also be included withinthe broad term ‘memory element’ as used herein.

In certain example implementations, the functions outlined herein may beimplemented by logic encoded in one or more tangible media (e.g.,embedded logic provided in an ASIC, digital signal processor (DSP)instructions, software (potentially inclusive of object code and sourcecode) to be executed by a processor, or other similar machine, etc.),which may be inclusive of non-transitory computer-readable media. Insome of these instances, memory elements can store data used for theoperations described herein. This includes the memory elements beingable to store software, logic, code, or processor instructions that areexecuted to carry out the activities described herein.

In an example implementation, elements of system 100, such as networkelements 102 a-102 f may include software modules (e.g., communicationengine 108) to achieve, or to foster, operations as outlined herein.These modules may be suitably combined in any appropriate manner, whichmay be based on particular configuration and/or provisioning needs. Inexample embodiments, such operations may be carried out by hardware,implemented externally to these elements, or included in some othernetwork device to achieve the intended functionality. Furthermore, themodules can be implemented as software, hardware, firmware, or anysuitable combination thereof. These elements may also include software(or reciprocating software) that can coordinate with other networkelements in order to achieve the operations, as outlined herein.

Additionally, each of network elements 102 a-102 f may include aprocessor that can execute software or an algorithm to performactivities as discussed herein. A processor can execute any type ofinstructions associated with the data to achieve the operations detailedherein. In one example, the processors could transform an element or anarticle (e.g., data) from one state or thing to another state or thing.In another example, the activities outlined herein may be implementedwith fixed logic or programmable logic (e.g., software/computerinstructions executed by a processor) and the elements identified hereincould be some type of a programmable processor, programmable digitallogic (e.g., a field programmable gate array (FPGA), an erasableprogrammable read-only memory (EPROM), EEPROM) or an ASIC that includesdigital logic, software, code, electronic instructions, or any suitablecombination thereof. Any of the potential processing elements, modules,and machines described herein should be construed as being encompassedwithin the broad term ‘processor.’

Turning to FIG. 2A, FIG. 2A is a simplified block diagram of an exampleof communications in system 100. In an example, network element 102 acan include memory 104, application 106 a, communication engine 108, andNICs 110 a and 110 b. Memory 104 a can include NIC table 120. Networkelement 102 b can include memory 104, application 106 b, communicationengine 108, and NICs 110 d and 110 e. Memory 104 b can include NIC table120. Network element 102 c can include memory 104, application 106 c,communication engine 108, and NICs 110 g and 110 h. Memory 104 c caninclude NIC table 120. Network element 102 d can include memory 104,application 106 d, communication engine 108, and NICs 110 j and 110 k.Memory 104 d can include NIC table 120.

NIC 110 a in network element 102 a can be in communication with firstlevel switch 122 a using first level subnet path 124 a and NIC 110 d innetwork element 102 b can be in communication with first level switch122 a using first level subnet subnet path 124 b. First level switch 122a can be in communication with network 114 using first subnet path 126a. In addition, NIC 110 g in network element 102 c can be incommunication with first level switch 122 c using first level subnetpath 124 c and NIC 110 j in network element 102 d can be incommunication with first level switch 122 c using first level subnetpath 124 d. First level switch 122 c can be in communication withnetwork 114 using first subnet path 126 b. NICs 110 a, 110 d, 110 g, and110 j and first level switches 122 a and 122 c are on the same firstsubnet.

NIC 110 b in network element 102 a can be in communication with firstlevel switch 122 b using first level subnet path 128 a and NIC 110 e innetwork element 102 b can be in communication with first level switch122 b using first level subnet subnet path 128 b. First level switch 122b can be in communication with network 114 using second subnet path 130a. In addition, NIC 110 h in network element 102 c can be incommunication with first level switch 122 d using first level subnetpath 128 c and NIC 110 k in network element 102 d can be incommunication with first level switch 122 d using first level subnetpath 128 d. First level switch 122 d can be in communication withnetwork 114 using second subnet path 130 b. NICs 110 b, 110 e, 110 h,and 110 k and first level switches 122 b and 122 d are on the samesecond subnet.

In an example, application 106 a can communicate with application 106 cusing NIC 110 a and NIC 110 g. In one direction, the communication wouldtravel from NIC 110 a to first level switch 122 a on first level subnetpath 124 a, from first level switch 122 a to network 114 on first subnetpath 126 a, from network 114 to first level switch 122 c on first subnetpath 126 b, and from first level switch 122 c to NIC 110 g on firstlevel subnet path 124 c. In the opposite direction, the communicationwould travel from NIC 110 g to first level switch 122 c on first levelsubnet path 124 c, from first level switch 122 c to network 114 on firstsubnet path 126 b, from network 114 to first level switch 122 a on firstsubnet path 126 a, and from first level switch 122 a to NIC 110 a onfirst level subnet path 124 a. Issues with the communication path fromfirst level switch 122 a, through network 114, to first level switch 122c are left for the network architecture to handle. However, the networkarchitecture cannot handle issues from network element 102 a to firstlevel switch 122 a or from network element 102 c to first level switch122 c.

Turning to FIG. 2B, FIG. 2B is a simplified block diagram of an exampleof communications in system 100. As illustrated in FIG. 2B, first levelsubnet path 124 c has been compromised and is preventing communicationbetween NIC 110 g and first level switch 122 c. As a result, networkelement 102 c and application 106 c can no longer communicate withnetwork element 102 a and application 106 a on the first subnetassociated with first level subnet path 124 c.

In an example, communication engine 108 a can receive a failurenotification from NIC 110 a that communications with NIC 110 g havefailed. Communication engine 108 a can analyze NIC table 120 anddetermine that NIC 110 g is located on network element 102 c and thatlocal NIC 110 b is in communication with NIC 110 h which is also locatedon network element 102 c. Communication engine 108 a can send an out ofband message to communication engine 108 c using NIC 110 b, first levelsubnet path 128 a, first level switch 122 b, second subnet path 130 a,network 114, second subnet path 130 b, first level switch 122 d, firstlevel subnet path 128 c, and NIC 110 h.

Communication engine 108 c can receive the out of band message anddisable NIC 110 g. Also, communication engine 108 c can resend any datathat was not delivered or for which an acknowledgement was not received.In addition, communication engine 108 c can change the entry for NIC 110g in NIC table 120 to indicate that NIC 110 g is disabled. The updatedNIC table 120 can be communicated to the network elements in system 100.

In another example, communication engine 108 c can receive a failurenotification from NIC 110 g that communications with NIC 110 a havefailed. Communication engine 108 c can analyze NIC table 120 anddetermine that NIC 110 a is located on network element 102 a and thatlocal NIC 110 h is in communication with NIC 110 b which is also locatedon network element 102 a. Communication engine 108 c can send an out ofband message to communication engine 108 a using NIC 110 h, first levelsubnet path 128 c, first level switch 122 d, second subnet path 130 b,network 114, second subnet path 130 a, first level switch 122 b, firstlevel subnet path 128 a, and NIC 110 b.

Communication engine 108 a can receive the out of band message anddisable NIC 110 a. In addition, communication engine 108 a can changethe entry for NIC 110 g in NIC table 120 to indicate that NIC 110 g isdisabled. The updated NIC table 120 can be communicated to the networkelements in system 100.

Turning to FIG. 2C, FIG. 2C is a simplified block diagram of an exampleof communications in system 100. As illustrated in FIG. 2C, first levelsubnet path 124 a has been compromised and is preventing communicationbetween NIC 110 a and first level switch 122 a. As a result, networkelement 102 a and application 106 a can no longer communicate withnetwork element 102 c and application 106 c on the first subnetassociated with first level subnet path 124 a.

In an example, communication engine 108 c can receive a failurenotification from NIC 110 g that communications with NIC 110 a havefailed. Communication engine 108 c can analyze NIC table 120 anddetermine that NIC 110 a is located on network element 102 a and thatlocal NIC 110 h is in communication with NIC 110 b which is also locatedon network element 102 a. Communication engine 108 c can send an out ofband message to communication engine 108 a using NIC 110 h, first levelsubnet path 128 c, first level switch 122 d, second subnet path 130 b,network 114, second subnet path 130 a, first level switch 122 b, firstlevel subnet path 128 a, and NIC 110 b.

Communication engine 108 a can receive the out of band message anddisable NIC 110 b. Also, communication engine 108 a can resend any datathat was not delivered or for which an acknowledgement was not received.In addition, communication engine 108 a can change the entry for NIC 110b in NIC table 120 to indicate that NIC 110 b is disabled. The updatedNIC table 120 can be communicated to the network elements in system 100.

In another example, communication engine 108 a can receive a failurenotification from NIC 110 b that communications with NIC 110 g havefailed. Communication engine 108 a can analyze NIC table 120 anddetermine that NIC 110 g is located on network element 102 c and thatlocal NIC 110 b is in communication with NIC 110 h which is also locatedon network element 102 c. Communication engine 108 a can send an out ofband message to communication engine 108 c using NIC 110 b, first levelsubnet path 128 a, first level switch 122 b, second subnet path 130 a,network 114, second subnet path 130 b, first level switch 122 d, firstlevel subnet path 128 c, and NIC 110 h.

Communication engine 108 c can receive the out of band message anddisable NIC 110 g. In addition, communication engine 108 c can changethe entry for NIC 110 b in NIC table 120 to indicate that NIC 110 b isdisabled. The updated NIC table 120 can be communicated to the networkelements in system 100.

Turning to FIG. 3, FIG. 3 is a simplified block diagram of an example ofNIC table 120 for use in communications in system 100. As illustrated inFIG. 3, NIC table 120 can include a NIC column 134, a subnet column 136,a node column 138, and an active column 140. NIC column 134 can includean identifier of NICs in system 100. Subnet column 136 can indicate whatsubnet is associated with each NIC in NIC column 134. Node column 138can indicate what node or network element is associated with each NIC inNIC column 134. Active column 140 can indicate if the NIC in NIC column134 is active on the subnet associated with the NIC.

Turning to FIG. 5, FIG. 5 is a simplified block diagram illustratingexample details of a message 142 for use in system 100, in accordancewith an embodiment of the present disclosure. Message 142 can be used incommunications between network elements. For example, message 142 can beused for a PUT command, a GET command, when communicating a message thatincludes an amount of data that is less than an MTU, when communicatinga message that includes an amount of data that is more than an MTUthreshold, etc. Message 142 can include a header portion 144, anenvelope portion 146, and a payload portion 148. Header portion 144 caninclude addressing and other data that is required for message 142 toreach its intended destination. Header portion 144 can also includeoffset data and MU related data. In an example, envelope portion 146 ispart of header portion 144.

Envelope portion 146 can include data to help distinguish variousmessages and allow a network element to selectively receive message 142.Envelope portion 146 is typically implementation dependent and caninclude a message tag, communicator, source, destination, messagelength, tag match bits, message ID, and other implementation specificinformation. In an example, envelope portion 146 may includecommunicator data. The communicator data can include a handle to a groupor ordered set of processes and can specify a communication context.Payload portion 148 can include the payload data of message 142. In anexample, payload portion 148 can include a manifest.

Turning to FIG. 5, FIG. 5 is a simplified block diagram illustratingexample details of header portion 144 for use in system 100, inaccordance with an embodiment of the present disclosure. Header portion144 can include an offset portion 150 and an MU data portion 152. MUdata portion 152 can include an MU enabled portion 154, a NIC portion156, a message length portion 158, and a unique message portion 160.Offset portion 150 includes the offset in the allocated memory regionfrom which to start pulling data for the response to a GET command. Inan example, MU data portion 152 may be a 64-bit field.

MU enabled portion 154 may be related to a MU parameter. For example, avalue of zero in MU enabled portion 154 can indicate that the MUprotocol is disabled or not chosen. A value of one in MU enabled portion154 can indicate that, in the initiator to receiver direction, the MU isenabled. A value of two in MU enabled portion 154 can indicate that, inthe receiver to initiator direction, the MU is enabled and may indicatea cleanup. A value of three in MU enabled portion 154 can be reserved.In a specific example, MU enabled portion 154 may be a two-bit integer.

NIC portion 156 can indicate NIC related data. For example, a zero valuein NIC portion 156 can be undefined. A one value in NIC portion 156 canbe initiator specific and specify how many NICs an indicator hasavailable for the receiver to pull the chunks of the message. A twovalue in NIC portion 156 can be receiver specific and specify how manyof the initiator side NICs the receiver will use to pull the chunks ofthe message. This can also be used to provide the indicator with a countof how many GET commands to expect from the receiver before cleaning upallocated regions of memory. A three value in NIC portion 156 can bereserved. In a specific example, NIC portion 156 may be a three-bitinteger.

Message length portion 158 can indicate the size of the message length.For example, a zero value in message length portion 158 can beundefined. A one value in the message length portion 158 can indicatethe length of the initiator's allocated regions of memory. A two valuein message length portion 158 can indicate the possible length of thepull request across the NICs. A three value in message length portion158 can be reserved. In a specific example, message length portion 158may be a thirty-two bit integer.

In a PUT command, unique message portion 160 can indicate the tagmatching or message ID for the communication related to message 142. Ina responding GET command from the receiver, the message ID is used asthe tag matching bits and is included in the envelope portion of themessage. In an example, a zero value in unique message portion 160 canbe undefined. A one value in unique message portion 160 can indicatethat the initiator generated unique message ID for a given multi-unittransaction. A two value in unique message portion 160 can be used bythe receiver to echo the message ID field from the initiator. A threevalue in the unique message portion 160 can be reserved. In a specificexample, unique message portion 160 may be a twenty-eight bit integer.

Turning to FIG. 6A, FIG. 6A is a simplified block diagram illustratingexample details of envelope portion 146 a for use in system 100, inaccordance with an embodiment of the present disclosure. Envelopeportion 146 a can include a source portion 162, a destination portion164, and a tag match bits portion 166. Source portion 162 can helpidentify the source of message 142 and is implicitly determined by theidentity of the source of message 142. The source of message 142 may bean initiator or a receiver (e.g., an initiator that is the source of aPUT command message or a receiver that is the source of a GET commandmessage). Destination portion 164 can help identify the destination ofmessage 142 and is specified by a dent argument in MPI. The destinationmay be an initiator or a receiver. Tag match bits portion 166 can helpprovide a mechanism for distinguishing between different messages. InMPI, a sending process must associate a tag with a message to helpidentify the message.

Turning to FIG. 6B, FIG. 6B is a simplified block diagram illustratingexample details of envelope portion 146 b for use in system 100, inaccordance with an embodiment of the present disclosure. In an initialPUT command from the initiator, the tag matching bits are provided bythe initiator (e.g., communication engine 108), and are matched againstthe tag match bits of receives at the receiver. When either rendezvousor MU are in use, the tag matching bits in the GET command(s) are set tothe message ID. The receiver obtains the message ID from the header dataof the original PUT from the initiator. Envelope portion 146 b caninclude source portion 162, destination portion 164, and a message IDportion 168. Message ID portion 168 can include the message ID that willallow the initiator to identify the exposed region of memory on theinitiator where the data for the message is located. In addition, thereceiver can use the message ID (and the NIC where data was received) todetermine the allocated memory region on the receiver that a chunk ofdata should be placed.

Turning to FIG. 7, FIG. 7 is a simplified block diagram illustratingexample details of payload portion 148 for use in system 100, inaccordance with an embodiment of the present disclosure. Payload portion148 can include a manifest 170. Manifest 170 can include NIC data 172a-172 c and a subnet ID 174. NIC data 172 a-172 c can include anidentification of each specific NIC on the initiator that a receiver canuse to pull data from the initiator.

Subnet ID 174 identifies the subnet that includes the NIC thatcommunicated the message. A subnet consists of a set of ports where eachport has one or more NICs such that all NICs in a given subnet areunique. In clusters that consists of multiple planes (subnets), thefabric manager will configure each plane (subnet) with a separate subnetID. This can also be applied to examples where each NIC has multipleports and each port is assigned to a different subnet.

Turning to FIG. 8, FIG. 8 is an example flowchart illustrating possibleoperations of a flow 800 that may be associated with the communicationof a message using a NIC on a subnet, in accordance with an embodiment.In an embodiment, one or more operations of flow 800 may be performed bycommunication engine 108. At 802, a message to be communicated from aninitiator to a receiver on a first subnet is determined. At 804, thesystem determines if the message satisfies a threshold. For example, thethreshold may be an MTU. If the message does not satisfy the threshold,then the message is sent using a standard protocol, as in 806. Forexample, if the message is small, the message can be sent using an MPIPUT command because the entire message fits in a single packet and isdelivered to the receiver. If the message is large (e.g. larger than anMTU and MU is not enabled), then the message may be sent using a RTS/CTSmechanism.

If the message does satisfy the threshold, then a plurality of NICsavailable to communicate the message are determined, as in 808. At 810,an area of memory that includes the data for the message is determined.At 812, the determined area of memory that includes the data for themessage is exposed to each of the plurality of NICs. At 814, a manifestthat identifies the plurality of NICs and the subnet is created. At 816,the manifest is communicated to the receiver. In an example, a ME isposted on each of the determined plurality of NICs, where the ME pointsto the exposed area of memory that includes the data for the message.The ME is associated with a message ID and the message ID can be sent tothe receiver in a PUT command. When an initiator receives a GET commandat a NIC from a receiver, the GET command will include the message IDand the message ID can be used to determine the ME and the exposed areaof memory that includes the data to be communicated to the receiver.

Turning to FIG. 9, FIG. 9 is an example flowchart illustrating possibleoperations of a flow 900 that may be associated with the communicationof a message using a NIC on a subnet, in accordance with an embodiment.In an embodiment, one or more operations of flow 900 may be performed bycommunication engine 108. At 902, a message to be communicated from aninitiator to a receiver on a first subnet is determined. At 904, thesystem determines if the size of the message satisfies a threshold. Forexample, the threshold may be an MTU threshold. If the size of messagedoes not satisfy the threshold, then the message is sent using astandard protocol, as in 906. For example, if the message is small, themessage can be sent using an MPI PUT command because the entire messagefits in a single packet and is delivered to the receiver. If the messageis large (e.g. larger than an MTU and MU is not enabled), then themessage may be broken up into chunks and each chunk can be sent using aRTS/CTS mechanism to transfer each chunk.

If the size of the message does satisfy the threshold, then a pluralityof NICs available to communicate the message are determined, as in 908.At 910, a manifest is created that includes an identification of thedetermined NICs available to communicate the message and a subnet IDthat identifies the first subnet. At 912, the area of memory thatincludes the data for the message is exposed to each of the determinedNICs available to communicate the message. At 914, an MU message iscreated and the manifest is added to the MU message. In an example, amessage ID is also added to the MU message (e.g., the message ID isadded to the envelope portion of an MU message). At 916, the MU messageis communicated to the receiver using a PUT command.

Turning to FIG. 10, FIG. 10 is an example flowchart illustratingpossible operations of a flow 1000 that may be associated with thecommunication of a message using a NIC on a subnet, in accordance withan embodiment. In an embodiment, one or more operations of flow 1000 maybe performed by communication engine 108. At 1002, data is communicatedon a first subnet from a first NIC on a first network element to asecond NIC on a second network element. At 1004, a connection betweenthe first NIC and a first subnet switch in communication with the firstNIC fails. At 1006, A third NIC on the second device sends a messageover a second subnet to a fourth NIC on the first device to disable thefirst NIC. At 1008, the first NIC is flagged as disabled in a NIC table.

Turning to FIG. 11, FIG. 11 is an example flowchart illustratingpossible operations of a flow 1100 that may be associated with thecommunication of a message using a NIC on a subnet, in accordance withan embodiment. In an embodiment, one or more operations of flow 1100 maybe performed by communication engine 108. At 1102, a first networkelement with a first NIC attempts to communicate with a second elementwith a second NIC, where the communication is over a first subnet. At1104, the attempted communication triggers an error. At 1106, the systemdetermines if the error is a fatal error. If the error is not a fatalerror, then the system returns to 1102 where the first first networkelement with the first NIC attempts to communicate with the secondelement with the second NIC, where the communication is over a firstsubnet

If the error is a fatal error, then the system determines if the secondnetwork element has a third NIC on a different subnet that can be usedfor the communication, as in 1108. If the second network element has athird NIC on a different subnet that can be used for the communication,then the second NIC is disabled and the communication is attempted onthe third NIC using the different subnet, as in 1110. In an example, thethird NIC is marked as disabled in a NIC table. If the second networkelement does not have a third NIC on a different subnet that can be usedfor the communication, then the communication cannot be established, asin 1112.

It is also important to note that the operations in the preceding flowdiagrams (i.e., FIGS. 8-11) illustrate only some of the possiblecorrelating scenarios and patterns that may be executed by, or within,system 100. Some of these operations may be deleted or removed whereappropriate, or these operations may be modified or changed considerablywithout departing from the scope of the present disclosure. In addition,a number of these operations have been described as being executedconcurrently with, or in parallel to, one or more additional operations.However, the timing of these operations may be altered considerably. Thepreceding operational flows have been offered for purposes of exampleand discussion. Substantial flexibility is provided by system 100 inthat any suitable arrangements, chronologies, configurations, and timingmechanisms may be provided without departing from the teachings of thepresent disclosure.

Although the present disclosure has been described in detail withreference to particular arrangements and configurations, these exampleconfigurations and arrangements may be changed significantly withoutdeparting from the scope of the present disclosure. Moreover, certaincomponents may be combined, separated, eliminated, or added based onparticular needs and implementations. Additionally, although system 100have been illustrated with reference to particular elements andoperations that facilitate the communication process, these elements andoperations may be replaced by any suitable architecture, protocols,and/or processes that achieve the intended functionality of system 100.

Numerous other changes, substitutions, variations, alterations, andmodifications may be ascertained to one skilled in the art and it isintended that the present disclosure encompass all such changes,substitutions, variations, alterations, and modifications as fallingwithin the scope of the appended claims. In order to assist the UnitedStates Patent and Trademark Office (USPTO) and, additionally, anyreaders of any patent issued on this application in interpreting theclaims appended hereto, Applicant wishes to note that the Applicant: (a)does not intend any of the appended claims to invoke paragraph six (6)of 35 U.S.C. section 112 as it exists on the date of the filing hereofunless the words “means for” or “step for” are specifically used in theparticular claims; and (b) does not intend, by any statement in thespecification, to limit this disclosure in any way that is not otherwisereflected in the appended claims.

Other Notes and Examples

Example C1 is at least one machine readable storage medium having one ormore instructions that when executed by at least one processor, causethe at least one processor to determine a network interface controller(NIC) in a first subnet to be used to communicate a message to areceiver, wherein the NIC is configured to comply with a message passinginterface protocol, create a manifest that includes an identifier of theNIC and a subnet ID that identifies the first subnet, and communicatethe manifest to the receiver.

In Example C2, the subject matter of Example C1 can optionally includewhere the subject matter of any one of Examples C1-C3 can optionallyinclude where the one or more instructions further cause the at leastone processor to add the subnet ID to a NIC table, wherein the NIC tableincludes an identification of a plurality of NICs, a subnet associatedwith each of the plurality of NICs, and an identification of a specificnetwork element that includes a specific NIC.

In Example C3, the subject matter of any one of Examples C1-C2 canoptionally include where the NIC table also includes an indicator whenthe specific NIC is active.

In Example C4, the subject matter of any one of Examples C1-C3 canoptionally include where the one or more instructions further cause theat least one processor to determine that the NIC in the first subnet tobe used to communicate the message is a non-operating NIC and is notoperational, determine the network element that includes thenon-operating NIC, determine an alternate NIC on a different subnet buton the same network element that includes the non-operating NIC, andcommunicate with the alternate NIC using the different subnet.

In Example C5, the subject matter of any one of Examples C1-C4 canoptionally include where the communication with the alternate NIC is arequest to disable the non-operating NIC.

In Example C6, the subject matter of any one of Examples C1-C5 canoptionally include where the communication with the alternate NIC is arequest to disable the non-operating NIC and once the non-operating NICis disabled, stale packets that arrive at the non-operating NIC aredropped.

In Example C7, the subject matter of any one of Examples C1-C6 canoptionally include where the NIC implements a Portals protocol.

In Example A1, a system can include a first network element, memory, acommunication engine, and at least one processor. The communicationengine is configured to cause the at least one processor to determine anetwork interface controller (NIC) in a first subnet to be used tocommunicate a message to a receiver, wherein the NIC is configured tocomply with a message passing interface protocol, create a manifest thatincludes an identifier of the NIC and a subnet ID that identifies thefirst subnet, and communicate the manifest to the receiver.

In Example, A2, the subject matter of Example Al can optionally includewhere the communication engine is further configured to cause the atleast one processor to add the subnet ID to a NIC table, wherein the NICtable includes an identification of a plurality of NICs, a subnetassociated with each of the plurality of NICs, and an identification ofa specific network element that includes a specific NIC.

In Example A3, the subject matter of any one of Examples A1-A2 canoptionally include where the NIC table also includes an indicator whenthe specific NIC is active.

In Example A4, the subject matter of any one of Examples A1-A3 canoptionally include where the communication engine is further configuredto cause the at least one processor to determine that the NIC in thefirst subnet to be used to communicate the message is a non-operatingNIC and is not operational, determine the network element that includesthe non-operating NIC, determine an alternate NIC on a different subnetbut on the same network element that includes the non-operating NIC, andcommunicate with the alternate NIC using the different subnet.

In Example A5, the subject matter of any one of Examples A1-A4 canoptionally include where the communication with the alternate NIC is arequest to disable the non-operating NIC.

Example M1 is a method including determining a network interfacecontroller (NIC) in a first subnet to be used to communicate a messageto a receiver, wherein the NIC is configured to comply with a messagepassing interface protocol, creating a manifest that includes anidentifier of the NIC and a subnet ID that identifies the first subnet,and communicating the manifest to the receiver.

In Example M2, the subject matter of Example M1 can optionally includeadding the subnet ID to a NIC table, wherein the NIC table includes anidentification of a plurality of NICs, a subnet associated with each ofthe plurality of NICs, and an identification of a specific networkelement that includes a specific NIC.

In Example M3, the subject matter of any one of the Examples M1-M2 canoptionally include where the NIC table also includes an indicator whenthe specific NIC is active on a specific subnet that is associated withthe specific NIC.

In Example M4, the subject matter of any one of the Examples M1-M3 canoptionally include determining that the NIC in the first subnet to beused to communicate the message is a non-operating NIC and is notoperational, determining the network element that includes thenon-operating NIC, determining an alternate NIC on a different subnetbut on the same network element that includes the non-operating NIC, andcommunicating with the alternate NIC using the different subnet.

In Example M5, the subject matter of any one of the Examples M1-M4 canoptionally include where the communication with the alternate NIC is arequest to disable the non-operating NIC in the first subnet.

In Example M6, the subject matter of any one of Examples M1-M5 canoptionally include where the message passing interface protocol is usedto communicate the manifest to the receiver using an active NIC on adifferent subnet.

Example S1 is a system for communication of a message a networkinterface controller (NIC) on a subnet, the system can include memory,one or more processors, and a communication engine. The communicationengine can be configured to determine a network interface controller(NIC) in a first subnet to be used to communicate a message to areceiver, wherein the NIC is configured to comply with a message passinginterface protocol, create a manifest that includes an identifier of theNIC and a subnet ID that identifies the first subnet, and communicatethe manifest to the receiver.

In Example S2, the subject matter of Example S1 can optionally includewhere the communication engine is further configured to add the subnetID to a NIC table, wherein the NIC table includes an identification of aplurality of NICs, a subnet associated with each of the plurality ofNICs, and an identification of a specific network element that includesa specific NIC.

In Example S3, the subject matter of any one of the Examples S1-S2 canoptionally include where the NIC table also includes an indicator whenthe specific NIC is active.

In Example S4, the subject matter of any one of the Examples S1-S3 canoptionally include where the communication engine is further configuredto determine that the NIC in the first subnet to be used to communicatethe message is a non-operating NIC and is not operational, determine thenetwork element that includes the non-operating NIC, determine analternate NIC on a different subnet but on the same network element thatincludes the non-operating NIC, and communicate with the alternate NICusing the different subnet.

In Example S5, the subject matter of any one of the Examples S1-S4 canoptionally include where the communication with the alternate NIC is arequest to disable the non-operating NIC.

In Example S6, the subject matter of any one of the Examples S1-S5 canoptionally include where the NIC implements a Portals protocol.

In Example S7, the subject matter of any one of the Examples S1-S6 canoptionally include where the message passing interface protocol is usedto communicate the manifest to the receiver.

Example AA1 is an apparatus including means for determining a networkinterface controller (NIC) in a first subnet to be used to communicate amessage to a receiver, wherein the NIC is configured to comply with amessage passing interface protocol, means for creating a manifest thatincludes an identifier of the NIC and a subnet ID that identifies thefirst subnet, and means for communicating the manifest to the receiver.

In Example AA2, the subject matter of Example AA1 can optionally includemeans for adding the subnet ID to a NIC table, wherein the NIC tableincludes an identification of a plurality of NICs, a subnet associatedwith each of the plurality of NICs, and an identification of a specificnetwork element that includes a specific NIC.

In Example AA3, the subject matter of any one of Examples AA1-AA2 canoptionally include where the NIC table also includes an indicator whenthe specific NIC is active.

In Example AA4, the subject matter of any one of Examples AA1-AA3 canoptionally include means for determining that the NIC in the firstsubnet to be used to communicate the message is a non-operating NIC andis not operational, means for determining the network element thatincludes the non-operating NIC, means for determining an alternate NICon a different subnet but on the same network element that includes thenon-operating NIC, and means for communicating with the alternate NICusing the different subnet.

In Example AA5, the subject matter of any one of Examples AA1-AA4 canoptionally include where the communication with the alternate NIC is arequest to disable the non-operating NIC.

In Example AA6, the subject matter of any one of Examples AA1-AA5 canoptionally include where the communication with the alternate NIC is arequest to disable the non-operating NIC and once the non-operating NICis disabled, stale packets that arrive at the non-operating NIC aredropped.

In Example AA7, the subject matter of any one of Examples AA1-AA6 canoptionally include where the NIC implements a Portals protocol.

Example X1 is a machine-readable storage medium includingmachine-readable instructions to implement a method or realize anapparatus as in any one of the Examples A1-A5, AA1-AA7, or M-M6. ExampleY1 is an apparatus comprising means for performing of any of the Examplemethods 1-M6. In Example Y2, the subject matter of Example Y1 canoptionally include the means for performing the method comprising aprocessor and a memory. In Example Y3, the subject matter of Example Y2can optionally include the memory comprising machine-readableinstructions.

What is claimed is:
 1. At least one machine readable medium comprisingone or more instructions that, when executed by at least one processor,causes the at least one processor to: determine a network interfacecontroller (NIC) in a first subnet to be used to communicate a messageto a receiver, wherein the NIC is configured to comply with a messagepassing interface protocol; create a manifest that includes anidentifier of the NIC and a subnet ID that identifies the first subnet;and communicate the manifest to the receiver.
 2. The at least onemachine readable medium of claim 1, wherein the one or more instructionsfurther cause the at least one processor to: add the subnet ID to a NICtable, wherein the NIC table includes an identification of a pluralityof NICs, a subnet associated with each of the plurality of NICs, and anidentification of a specific network element that includes a specificNIC.
 3. The at least one machine readable medium of claim 2, wherein theNIC table also includes an indicator when the specific NIC is active. 4.The at least one machine readable medium of claim 2, wherein the one ormore instructions further cause the at least one processor to: determinethat the NIC in the first subnet to be used to communicate the messageis a non-operating NIC and is not operational; determine the networkelement that includes the non-operating NIC; determine an alternate NICon a different subnet but on a same network element that includes thenon-operating NIC; and communicate with the alternate NIC using thedifferent subnet.
 5. The at least one machine readable medium of claim4, wherein the communication with the alternate NIC is a request todisable the non-operating NIC.
 6. The at least one machine readablemedium of claim 4, wherein the communication with the alternate NIC is arequest to disable the non-operating NIC and once the non-operating NICis disabled, stale packets that arrive at the non-operating NIC aredropped.
 7. The at least one machine readable medium of claim 1, whereinthe NIC implements a Portals protocol.
 8. A system comprising: a firstnetwork element; memory; a communication engine; and at least oneprocessor, wherein the communication engine is configured to cause theat least one processor to: determine a network interface controller(NIC) in a first subnet to be used to communicate a message to areceiver, wherein the NIC is configured to comply with a message passinginterface protocol; create a manifest that includes an identifier of theNIC and a subnet ID that identifies the first subnet; and communicatethe manifest to the receiver.
 9. The system of claim 8, wherein thecommunication engine is further configured to cause the at least oneprocessor to: add the subnet ID to a NIC table, wherein the NIC tableincludes an identification of a plurality of NICs, a subnet associatedwith each of the plurality of NICs, and an identification of a specificnetwork element that includes a specific NIC.
 10. The system of claim 9,wherein the NIC table also includes an indicator when the specific NICis active.
 11. The system of claim 9, wherein the communication engineis further configured to cause the at least one processor to: determinethat the NIC in the first subnet to be used to communicate the messageis a non-operating NIC and is not operational; determine the networkelement that includes the non-operating NIC; determine an alternate NICon a different subnet but on a same network element that includes thenon-operating NIC; and communicate with the alternate NIC using thedifferent subnet.
 12. The system of claim 11, wherein the communicationwith the alternate NIC is a request to disable the non-operating NIC.13. A method comprising: determining a network interface controller(NIC) in a first subnet to be used to communicate a message to areceiver, wherein the NIC is configured to comply with a message passinginterface protocol; creating a manifest that includes an identifier ofthe NIC and a subnet ID that identifies the first subnet; andcommunicating the manifest to the receiver.
 14. The method of claim 13,further comprising: adding the subnet ID to a NIC table, wherein the NICtable includes an identification of a plurality of NICs, a subnetassociated with each of the plurality of NICs, and an identification ofa specific network element that includes a specific NIC.
 15. The methodof claim 14, wherein the NIC table also includes an indicator when thespecific NIC is active on a specific subnet that is associated with thespecific NIC.
 16. The method of claim 14, further comprising:determining that the NIC in the first subnet to be used to communicatethe message is a non-operating NIC and is not operational; determiningthe network element that includes the non-operating NIC; determining analternate NIC on a different subnet but on a same network element thatincludes the non-operating NIC; and communicating with the alternate NICusing the different subnet.
 17. The method of claim 16, wherein thecommunication with the alternate NIC is a request to disable thenon-operating NIC in the first subnet.
 18. The method of claim 13,wherein the message passing interface protocol is used to communicatethe manifest to the receiver using an active NIC on a different subnet.19. A system for communication of a message using a network interfacecontroller(NIC) on a subnet, the system comprising: memory; one or moreprocessors; and a communication engine, wherein the communication engineis configured to: determine a network interface controller (NIC) in afirst subnet to be used to communicate a message to a receiver, whereinthe NIC is configured to comply with a message passing interfaceprotocol; create a manifest that includes an identifier of the NIC and asubnet ID that identifies the first subnet; and communicate the manifestto the receiver.
 20. The system of claim 19, wherein the communicationengine is further configured to: add the subnet ID to a NIC table,wherein the NIC table includes an identification of a plurality of NICs,a subnet associated with each of the plurality of NICs, and anidentification of a specific network element that includes a specificNIC.
 21. The system of claim 20, wherein the NIC table also includes anindicator when the specific NIC is active.
 22. The system of claim 20,wherein the communication engine is further configured to: determinethat the NIC in the first subnet to be used to communicate the messageis a non-operating NIC and is not operational; determine the networkelement that includes the non-operating NIC; determine an alternate NICon a different subnet but on a same network element that includes thenon-operating NIC; and communicate with the alternate NIC using thedifferent subnet.
 23. The system of claim 22, wherein the communicationwith the alternate NIC is a request to disable the non-operating NIC.24. The system of claim 19, wherein the NIC implements a Portalsprotocol.
 25. The system of claim 19, wherein the message passinginterface protocol is used to communicate the manifest to the receiver.